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Abstract 

Various regularized linear discriminant analysis (LDA) methods have been proposed 
to address the problems of the classic methods in high-dimensional settings. Asymp¬ 
totic optimality has been established for some of these methods in high dimension when 
there are only two classes. A major difficulty in proving asymptotic optimality for mul¬ 
ticlass classification is that the classification boundary is typically complicated and no 
explicit formula for classification error generally exists when the number of classes is 
greater than two. For the Fisher’s LDA, one additional difficulty is that the covariance 
matrix is also involved in the linear constraints. The main purpose of this paper is 
to establish asymptotic consistency and asymptotic optimality for our sparse Fisher’s 
LDA with thresholded linear constraints in the high-dimensional settings for arbitrary 
number of classes. To address the first difficulty above, we provide asymptotic opti¬ 
mality and the corresponding convergence rates in high-dimensional settings for a large 
family of linear classification rules with arbitrary number of classes, and apply them to 
our method. To overcome the second difficulty, we propose a thresholding approach to 
avoid the estimate of the covariance matrix. We apply the method to the classification 
problems for multivariate functional data through the wavelet transformations. 
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1 Introduction 

The linear discriminant analysis (LDA) has been a favored tool for classification in the 
settings of small p and large n. The Fisher’s discriminant analysis is one of its important 




special cases. However, these classic methods face major problems for high-dimensional data. 
In theory, Bickel and Levina (2004) and Shao et al. (2011) showed that the usual LDA can 
be as bad as the random guessing when p > n. In practice, the classic methods have poor 
predictive performance in high-dimensional settings. To address these problems, various 
regularized discriminant analysis methods have been proposed, including Friedman (1989), 
Krzanowski et al. (1995), Dudoit et al. (2001), Bickel and Levina (2004), Guo et al. (2007), 
Xu et al. (2009), Tibshirani et al. (2002), Witten and Tibshirani (2011), Clcmmensen et al. 
(2011), Shao et al. (2011), Cai and Liu (2011), Fan et al. (2012), Qi et al. (2015) and many 
others. 

Asymptotic optimality has been established in some of these papers when there are two 
classes. Shao et al. (2011) made sparsity assumptions on both the difference S = /x 2 — fa, 
where fi l and pL 2 are the population means of the two classes, and the within-class covariance 
matrix E. Then they applied thresholding procedures to both the sample estimates of S and 
S, and obtained the asymptotic optimality and the corresponding convergence rate for their 
classification rule. Cai and Liu (2011) observed that in the case of two classes, the optimal 
classification rule depends on S only through E' 1 #. Hence, they assumed l\ sparsity for 
E^ 1 ^, proposed a sparse estimate of it through minimizing its l\ norm with an l^ constraint, 
and provided asymptotic optimality of their classification rule. Fan et al. (2012) imposed 
Iq sparsity assumption on E -1 <5, estimated it through a minimization problem with an l\ 
constraint and derived the asymptotic optimality. A difficulty preventing the derivation of 
asymptotic optimality of the linear classification rules for multiple classes is that for the two- 
class classification, the classification boundary of LDA is a hyperplane and an explicit formula 
for the classification error exists, however, for the multiclass classification, the classification 
boundary is usually complicated and no explicit formula for the classification error generally 
exist. The Fisher’s LDA projects the original variables X to a low dimensional subspace 
to generate new predictor variables, Xai, Xck 2 , ..., Xa/c-i, where the coefficient vectors 
a 1? ot 2 , ■ ■ ■, otR-i satisfy the linear constraints aJUacj = 0 for any 1 < j < i < K, and 
K is the number of classes. These constraints imply that Qj is orthogonal to the subspace 
spanned by (Eqi, • • • , Eq,_i}. 

The motivation of this paper is to establish the asymptotic consistency and the asymp- 
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totic optimality of the sparse Fisher’s LDA method proposed in Qi, Luo, Carroll and Zhao 
(2015) in the high-dimensional settings for arbitrary number of classes. However, in order 
to obtain the asymptotic consistency, we revise the original method because it is hard to 
obtain a consistent estimate for a general E in the high-dimensional settings without sparsity 
or other assumptions imposed on E. Instead of aiming to estimate S, we propose a soft- 
thresholding procedure and add it into the original method to get a consistent estimate of 
the subspace (Eqi, • • • , Eq^i}. We establish the asymptotic consistency of the estimates 
of oti and the subspace {Ea l5 • • • , EcKj.!} for the revised method. To prove the asymptotic 
optimality for this method, we establish the asymptotic optimality and the corresponding 
convergence rates in high-dimensional settings for a large family of linear classification rules 
with arbitrary number of classes under the situation of multivariate normal distribution. To 
assess the real performance of the revised method, we compare it with the original method 
and other sparse LDA methods through simulation studies. The revised method has good 
predictive performance as the original method and at the same time, it enjoys nice theoretical 
properties. We also apply the revised method to the classification problems for multivariate 
functional data through the wavelet transformations. 

The rest of this paper is organized as follows. In Section 2, we introduce notations and 
briefly review the classic Fisher’s discriminant analysis. Our sparse Fisher’s LDA method 
with thresholded linear constraints is introduced in Section 3. In Section 4, we present the 
main theoretical results. Sections 5 and 6 are simulation studies and applications, respec¬ 
tively. The proofs of all theorems are provided in supplementary material. 

2 Fisher’s discriminant analysis 

We first introduce the notations used throughout the paper. For any vector v = (ui, • • • , r> p ) T , 
let || v|| !, ||v|| 2 , and Hv^ = max!<j< p \vi\ denote the A, h, and norms of v, respectively. 
For any p x p symmetric matrix M, we use A max (M), A mm (M) and A+ ifl (M) to denote the 
largest eigenvalue, the smallest eigenvalue and the smallest positive eigenvalue of M, respec¬ 
tively. Now suppose that M is symmetric and nonnegative definite. We define two norms 
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for M, 


||M|| = sup ||Mv || 2 = A mra (M), and HM^ = max \M kl \ , (2.1) 

veRP,||v|| 2 =l 1 <k,l<p 

where Mm is the (k, /)-th entry of M. The hrst norm is the usual operator norm and is also 
called the spectral norm. The second is the max norm. 

Throughout this paper, we assume that the number K of classes is fixed and can be any 
positive integer. Suppose that the population in the i-th class has a multivariate normal 
distribution N p (/x 7 , X), where is the true class mean of the i-th class, 1 < i < K, and S 
is the true common within-class covariance matrix for all classes. We assume that the prior 
probabilities for all the classes are the same and equal to 1/K. It will be seen that when 
we add a constant vector to all the observations, the classification results do not change for 
all the classification rules involved in this paper. Therefore, without loss of generality, we 
assume that the overall mean of the whole population is zero, that is, 

Mi + /D + ''' + Mic — 0. (2.2) 

Define a p x K matrix U = /x 2 , • • • , fx K \, which is the collection of class means. Under 

the assumption (2.2), the between-class covariance matrix is 

K 

B = J2»,P.J/ K = VV T /K . (2.3) 

1=1 

The Fisher’s discriminant analysis method (when the true class means and the true covari¬ 
ance matrix are known) sequentially finds linear combinations Xaq, • • • , Xa^-i by solving 
the following generalized eigenvalue problem. Suppose that we have obtained oq, • • • , <**_!, 
where 1 < i < K —■ 1, then ck,; is the solution to 

maxa T Ba, subject to a ! Sa = 1, a T Saj = 0, 1 < j < i - I- (2.4) 

The Fisher’s classification rule is to assign a new observation x to class % if 

(x - M;) T D(x - < (x - Mj) T D(x - Hj) (2-5) 

for all 1 < j ^ i < K, where D = Ylk=i a k Ci k- 
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It is well known that under our setting (that is, the population in each class has a normal 
distribution with the same covariance matrix and the prior probabilities for all classes are 
the same), the optimal classification rule is to assign a new observation x to class i if 

(x - M i ) T S^ 1 (x - < (x - /x i ) T S" 1 (x - Hj) (2-6) 


for all 1 < j 7 ^ % < K (Theorem 6.8.1 in Anderson (2003) or Theorem 13.2 in Hardle and 
Simar (2012)). Moreover, the optimal rule (2.6) is equivalent to the Fisher’s discriminant 
rule (2.5). 

In practice, the true class means and £ are unknown. Consider a training data set, 
X = {xjj : 1 < i < K, 1 < j < rii}, where x y - is the jth observation from the ith class and 
n % is the number of the observations of the ith class. The numbers (ni, ri 2 , • • • ,tik) can be 
either random or nonrandom. Let n = Ylf=i n i be the total number of observations in the 
data. Throughout this paper, we use 


x* 

B 


-.Hi , K rii K rii 

— Y x b’ *=~YY x b’ ^ = ——K Y Y&i - x ' )(x b - X *) T ’ 

3 =1 2=1 3 = 1 2=1 J = 1 


- Y n *( x i - x ) (xj - x) T , 1 < i < K, 

n < 


to denote the sample class means, the sample overall mean, the sample within-class co- 
variance matrix and the sample between-class covariance matrix, respectively. Then the 
classic Fisher’s discriminant analysis is to sequentially obtain the estimates Si, • • •, ck^-i 
of ay, • • • , olr -1 by solving 


max a T Ba, 


subject to ck t £ck = 1 , e* T £ciq = 0 , 1 < j < i, 


( 2 . 8 ) 


where 1 < i < K — 1. The classification rule is to assign a new observation x to class i if 

(x - X;) t D(x - Xj) < (x - Xj) t D(x - Xj), (2.9) 

for all 1 < j ^ i < K, where D = Ylk=i • 
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3 Sparse Fisher’s discriminant analysis with thresh- 


olded linear constraints 

In the high-dimensional setting, the classic Fisher’s discriminant analysis has several draw¬ 
backs. First, X is not full rank, so the solution to (2.8) does not exist. Second, B and X 
as given in (2.7) are not consistent estimates in terms of the operator norm. Hence, the 
estimates of a*,, 1 < k < K — 1, obtained by classic Fisher’s discriminant analysis are not 
consistent. Third, suppose that we have obtained an estimate ay of ay, in order to estimate 
a 2 , we have to estimate the coefficient vector of the linear constraint in (2.4), Xay. However, 
even if ay is consistent, Xay is not a consistent estimate of Xo! due to the inconsistency 
of X. To address these drawbacks, we describe a revised method of the sparse Fisher’s 
discriminant analysis in Qi et al. (2015). 


3.1 The case of K = 2 

When there are two classes, there is only one component ai and B = + // 2 A t J)/2 = 

HifiJ because = — [i 2 - It is easily seen that ol\ = X _1 (5/\/5 T X -1 5, where S = fi 2 ~ AH- 
Cai and Liu (2011) and Fan et al. (2012) imposed l\ and l 0 sparsity assumptions on X^ 1 ^, 
respectively. Equivalently, we assume that a x is sparse in terms of li norm as in Cai and Liu 
(2011). In the case of K = 2, it is not necessary to revise the original method in Qi et al. 
(2015). The estimate ay of a! is the solution to 


max affia, subject to a 1 Xa + 'rllall \ = 1, (3.1) 

aelP 


where ||a||| = (1 — A)||a||| + A||a||f and both r > 0 and 0 < A < 1 are tuning parameters. 
The introduction of ||a||| overcomes the issue that X is not full rank in high-dimensional 
setting, and the term ||a||f encourages the sparsity of the solution. A difference between our 
penalty and the usual lasso or elastic-net penalty is that we use the squared h-norrn. This 
particular form of our penalty leads to the property that ay is also the solution to 


a [ Ba 

max -—- 

c*eRp,f*^o o; T Xa + rlla 


(3.2) 
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where the objective function is scale-invariant. That is, for any nonzero number t, the 
vector toL\ is also a solution to (3.2). This scale-invariant property is intensively used in 
our theoretical development. Once we obtain op, our classification rule is to assign a new 
observation x to class i if (x — x*) T D(x — 5q) < (x — Xj) T D(x — 5tj) for 1 < j ^ i < 2, where 

D = a i a! . 


3.2 The case of K > 2 


If K > 2, more than one components need to be estimated, op is estimated in the same way 
as K = 2. Since the higher order component oti, 1 < i < K — 1, satisfies the constraints 
in (2.4), a.i is actually orthogonal to the subspace spanned by {Sap,-- - , Xaj_x} in R p . 
Because a, is the eigenvector of the generalized eigenvalue problem (2.4), for any 1 < j < 
K — 1, Bor,- and Xa.,- have the same directions and only differ by a scale factor, which is the 
j’-tli eigenvalue. Hence, the subspace spanned by {Bap, • • • , Baj_x} is the same as that of 
{Sai, • • • , 

Because in the high-dimensional settings, X and B are not consistent estimates of X 
and B in terms of the operator norm, respectively, neither of the subspaces spanned by 
{Xaq, • • • , Xq,_!} and (Bax, • • • , Ba,_x} is a consistent estimate of the subspace spanned 
by {Xax,--- , Xa^x} (or by {Bax, • • • ,Ba,_x}), even if otj, 1 < j < i — 1, are consis¬ 
tent estimates. Therefore, in order to estimate these subspaces, in addition to the spar¬ 
sity assumption on (ax, • • * ,aA-_x}, we also make sparsity assumptions on the vectors, 
Xax, • • • , Xa/i'_x, in terms of l\ norm. Lemma 2 in Section 4 shows that making sparsity 
assumptions on Xax, • • • , Xa^-i is equivalent to or weaker than assuming the sparsity of 
{Hi — fj,j , 1 < i j < K } in terms of l\ norm. The latter assumption has been made in Shao 
et al. (2011). Bickcl and Levina (2004) assumes that n 1 and /x 2 are sparse when K = 2, 
which implies that /ij — /x 2 is sparse. 

Linder the above assumptions, suppose that we have obtained the estimate aq of a 2 , 
1 < j < i — 1, then we obtain the estimate $, j of B ctj as the solution to 


min 


£ 



+ «ll*lli 


(3.3) 
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where k > 0 is a tuning parameter. It can be shown that the Z-th coordinate of £ • is 


(€ 7 -)i = sign((Ba j 


Ba,)j| - /c/2 


I 


[|(Ba J ) l |>«/2]> 


1 <l<p, 


(3.4) 


where I[|(ga-)i|>«/ 2 ] * s ^ ie indicator function of [|(Baj)/| > /c/2]. So we actually estimate 
Bttj by applying the soft-thresholding to Boy. We will show that the subspace spanned 
by {£i, • • •, £j_i} is a consistent estimate of the subspace spanned by {Bap, • • • , Bq^i} 
and provide the convergence rate in Section 4. An alternative way to obtain a consistent 
estimate of the subspace is to apply the soft-threholding to Eap, • • • , Sqj_i. However, it 
turns out that the real predictive performance of this alternative is inferior to the proposed, 
so we do not consider it in this paper. Now suppose that we have obtained the estimates 
Si, • • •, OLi-i and £ 1; • • •, £j_i, then a* is the solution to 


max a x Ba, subject to a T Ea + r||a||l( = 1, a T ^ ; = 0, j < i. (3.5) 

The optimization problems (3.1) and (3.5) are both special cases of the following general 
problem: 


max a T IIa, subject to a T Ca + r||a||{ < 1, La = 0, (3.6) 

where II and C are any two p x p nonnegative definite symmetric matrices, and L is either 
equal to zero or any matrix with p columns. La = 0 can be viewed as linear constraints 
imposed on a. For example, (3.5) is the special case of (3.6) with II = B, C = X and 
L = (£i, • • • , £ i _ 1 ) T . In Qi et al. (2015), we solve (3.6) by the following algorithm. 

Algorithm 3.1. 

1. Choose an initial vector a ( °i with IIa (( b ^ 0. 

2. Iteratively compute a sequence a^), aS 2 \ • • • , a*®), • • • until convergence as follows: for 
any i > 1 , compute ah) by solving 

mmc(IIa < '®“ 1 ^) T a, subject to a T Ca + r||a||^ < 1, La = 0. (3.7) 



The key step (3.7) of Algorithm 3.1 is a special case of the following problem with 
c = naf’" 1 ): 

maxc T a, subject to a T Ca + r||a||^ < 1, La = 0, (3.8) 

OL 

where c is any nonzero vector. The algorithm and the related theory to solve (3.8) have 
been developed and described in details in Qi et al. (2015). 

Once we obtain all the estimates Si, •••, olr-i, we build the classification rule which 
assigns a new observation x to class i if 

(x - x,;) t D(x - Xj) < (x - Xj) t D(x - xQ, (3.9) 

for all 1 < j 7 ^ % < K, where 

D = (qi, • • • , ock- l) K 1 (qi, • • • , q:a-_i) T , (3.10) 

and K is a symmetric (K — 1) x (K — 1) matrix with the (i, j)-th entry equal to a/Sctj. 
This choice of D allows us to achieve a better convergence rate than D used in the classic 
Fishers discriminant analysis rule (2.9). 

In Qi et al. (2015), we proposed to estimate a, by solving 

max odBo:, subject to a T Sa + r||a||^ = 1, adBaq = 0, j < i, (3.11) 

where we used the unthresholded vector BcQ in the linear constraints. That method has 
a good empirical performance, but we cannot provide the theoretical results due to the 
difficulty mentioned at the beginning of Section 3. 

4 Asymptotic consistency and asymptotic optimality 

In this section, we will provide the asymptotic results of the method described in Section 3. 
We first consider two mechanisms of class label generation. The first is a random mechanism 
in which sample observations are randomly drawn from any of K classes with equal prob¬ 
ability 1/K. Hence, ( 711 , 712 , ,7 Ir) follows a multinomial distribution with parameters n 
and (1/K, • • • , 1/K). In this case, we have the following result. 
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Lemma 1. Suppose that (ri\, n 2 , • • • , %) follows a multinomial distribution with parameters 
n and (1/A', • • • , 1/A'). Given any ( K,n,p ) satisfying thatp > 2, K < p+ 1 and a/ A' log p/n 
is bounded by some constant d 0 , for any M > 0, we have 


P | max 

Ki<K 


rh 

n 


1 

K 


> C 


logy 

~Kn 


< p 


—M 


(4.1) 


for any C > (M + 3) (do + 1). 


The second mechanism is nonrandom, that is, (ni,ri 2 , • • • ,%) are nonrandom numbers. 
In this case, we will impose the following Condition 1 (a) on these numbers. 


Condition 1. 


(a) . If (ni,n 2 , ■ ■ ■ ,uk) are nonrandom, then there exists a constant Co (independent of n, 

p and K), such that we have maxi<j<x | nt/n — 1/A'| < Co\J\ogp/ ( Kn ) for all large 
enough n. 

(b) . There exists a constant cq > 0 (independent of n, p and K) such that 

Cq ^ A m ^(5I) ^ A ma3 ,(S) ^ Co and max ||/^||oo ^ ^o* 

1 <i<K 


Lemma 1 and Condition 1 (a) ensure that the number of observations in different classes 
do not differ greatly in each of the two mechanisms. The regularity condition for S in 
Condition 1 (b) has been used in Shao et al. (2011) and Cai and Liu (2011). The condition 
about Hj can be achieved by scaling each of the p variables. Under Condition 1, we have 
the following two probability inequalities about ||S — S||oo and ||B — B||oo, which play basic 
roles in our theoretical development. 


Theorem 4.1. Suppose that Condition 1 holds, p > 2, A' < p + 1 and Klogp/n —* 0 as 
n —> oo. Then for any M > 0, we can find C large enough and independent of n, p and K 
such that 



> C 


K logy 


n 


< p 


—M 



B 


> C 


K logy 


n 


< p 


-M 


for all large enough n. 
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Remark 1. Theorem 4-1 holds even if K —» oo as n —> oo. However, since we need the 
condition that K is bounded in the following theorems, we fix K in this paper. 

Define a p x p nonnegative definite matrix 

E = S' 1/2 BS~ 1/2 . (4.2) 

Solving the generalized eigenvalue problem (2.4) is equivalent to computing the eigenvalues 
and eigenvectors of E. In fact, because a*,, 1 < k < K — 1, are the generalized eigenvectors 
of the problem (2.4), we have 

Bat = z/fcSctfc, and hence, HS^ 2 a/i = (4.3) 

for any 1 < k < K — 1, where is the corresponding generalized eigenvalue. Therefore, 

7r = S 1 / 2 a 1 , 72 = S 1 / 2 a 2 , •••, 7 ^ = 1%-!, (4.4) 

are the eigenvectors of E with corresponding eigenvalues iq, z/ 2 ,..., vk- i, respectively. So 
they are orthogonal to each other. In the following, we will use Afc(E), 1 < k < K — 1, 
to denote the eigenvalues of E, which are just the above generalized eigenvalues and also 
equal to the maximum values of the optimization problems (2.4). Since E has the same 
rank as B which is not greater than K — 1 due to the constraint (2.2), E has at most 
K — 1 positive eigenvalues. By the conditions a^Saj. — 1, 1 < k < K ■— 1, we have 

II7J2 = II72II2 = • • • = H7A--1II2 = 1 - Let 

7i = S i/ 2 ai, 7 2 = £ 1 / 2 S 2 , • • • , 7if-i = S i/ 2 S A '_ 1 , (4.5) 

which are estimates of 7 X , •••, respectively. Since — 07 is also the solution to the 

optimization problem in (3.1) or (3.5), without loss of generality, we choose the sign of 
such that 7 A . 7 fc > 0, for 1 < k < K — 1. We impose the following regularity conditions on 
the eigenvalues of E. 

Condition 2. There exist positive constants c\, c 2 and C 3 which are all independent of n, 
p and K such that 

(a). Ai(E) > A 2 (E) > • • • > A A '_i(E) > c\, 
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<<>)■»»»{ Al % A , J(S) . • • ■. > <*. 

(c). The ratio between the largest and the smallest eigenvalue satisfies Ai(S)/A^_i(H) < c 3 . 


In the case of K = 2, we will show in Remark 2 (3) that Ai(H) has the same order as 
\\n 2 — /Xi|||. Therefore, roughly speaking, Condition 2 (a) implies that the class means are 
not too close to each other. Condition 2 (b) prevents the cases that the spacing between 
adjacent eigenvalues is too small. Condition 2 (c) excludes the situations where the effects 
of higher order components are dominated by those of lower order components and are 
negligible asymptotically. 

Now we consider the choice of the tuning parameters, r and A, in the penalized opti¬ 
mization problems (3.1) and (3.5). We will show that the choice of A is not essential for the 
asymptotic convergence rates as long as it is asymptotically bounded away from zero. In the 
following theorems, we will choose tuning parameters (r n , A n ), which depend on the sample 
size n, satisfying 


0 < \ n < 1, liminf A n > A 0 , r n — Cs n , where s n = 

n—>oo 


K log p 


n 


(4.6) 


Ao > 0 and C are constants independent of n, p and K. The constant C is chosen based on 
Theorem 4.1 such that for all large enough n, 


C 


C, 


P I ||£ - EHoo > — s n <p \ P ( ||B - BHoo > —s n <p \ 


CJ 


Co. 


-1 


(4.7) 


where C'o = 2(1 + c 1 )/Aq and ci is the constant in Condition 2 (a). Define the event 


n n = j||S - Slloo < r n /C 2 , ||B - B||oo < r„/C 2 } , (4.8) 

then by (4.7) P (Q n ) > 1 — 2 p^ 1 . 


We mainly consider the elements in Tl n in proofs. 

We adopt the same definition of asymptotic optimality for a linear classification rule as 
in Shao et al. (2011), Ca.i and Liu (2011), Fan et al. (2012) and other papers. Let T opt 
denote the optimal linear classification rule (2.5) or (2.6) and Ropt represent its misclas- 
sihcation error rate. Let T be any linear classification rule based on X. The conditional 
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misclassification rate of T given X is defined as 


K 


i= 1 


Rt(X) = P ( {x new belongs to the i-th class but T(x new ) i} 



where x uew is a new observation independent of X. Therefore, Rt(X.) is a function of X. 


Definition 1. Let T be a linear classification rule with conditional misclassification rate 
Rt(X-). Then T is asymptotically optimal if 


RtW 

Ropt 


1 = 0 ,( 1 ). 


(4.9) 


Since 0 < Ropt < -Rt(X) < 1 for any X, (4.9) implies that 0 < Rt(X.) — Ropt = o p (l). 
Hence we have i?r(X) —» Ropt hr probability and E[R t (X.)] — > Ropt, which have been 
used to define the consistency of a classification rule by Devroye et al. (1996) and others. If 
Ropt is bounded away from 0, then Rt(X) — Ropt = o p ( 1) also implies (4.9). However, if 
Ropt —* 0, (4.9) is stronger than Rt(X) — Ropt = o p (l). 

In the following, we will consider the asymptotic properties of our method and assume 
that K is fixed, p —* oo and s n = \JK log p/n —> 0 as n —> oo. The following theorem 
provides an upper bound for the l\ sparsity and the consistency of the estimator Si obtained 
from (3.1). 


Theorem 4.2. Suppose that Conditions 1 and 2 hold. If ||o: 1 ||^s n —* 0 as n,p —» oo, then 
for all large enough n, we have, in Q n , 


arlli < 6||ai||?/A 0 , ||7i - Till! < C' 5 ||ai||^ n , (4.10) 

Si - ailla < c 0 C' 5 ||ai||iS n , 


where Gf is a constant independent of n and p, Ao is the constant in (4.6), and Co is the 
constant in Condition 1 (b). Therefore, Si is a consistent estimate of a. i. 

By Theorem 4.2, the estimate Si has the same order of l\ sparsity as ap and in order 
that Si is consistent, we need ||cki|| f is o(^/n/ logp). In the following, we will consider the 
cases of K = 2 and K > 2, separately. 
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4.1 The case of K = 2 


When K = 2, there exists only one component op and E has one postive eigenvalue Ai(E). 
Therefore, Conditions 2 (b)-(c) are not necessary. We provide explicit formulas for the 
misclassification errors of the optimal rule with D = oliolJ and our rule with D = cha^, 
and prove the asymptotic optimality of our method in the following theorem. 

Theorem 4.3. Suppose that K = 2 and Conditions 1 and 2 (a) hold. Then the misclassifi¬ 
cation rate of the optimal rule (2.5) and the conditional misclassification rate of our sparse 
LDA rule in Section 3.1 are given by 

Ropt = 4> ^ 

fl(X) = it 

respectively, where <f> is the cumulative distribution function of the standard normal distri¬ 
bution, S = /x 2 — n 1 and S = x 2 — x^. Moreover, if A 1 (E)||a 1 ||fs n —> 0 as n,p —> oo, our 
method is asymptotically optimal and we have 

|2h_i = 0 ,(A 1 (B)||a 1 ||; S „). (4.12) 

it OPT 

Remark 2. 

(1) . The misclassification rate of the optimal rule is expressed as Ropt = 3* V<f r X _1 <5/2^ 

in Equation (1) in Shao et al. (2011) and Equation (5) in Cai and Liu (2011). Since 
by Lemma ?? (Supplementary Material), = D<5, the Ropt in (4-11) is the same 
as in those papers. 

(2) . Under the l\ sparsity on S _1 <5, Cai and Liu (2011) obtained the convergence rate 

- 1 = O p { (IIE-MII^ + IIS-MII?) /hp} . (4.13) 

in their Theorem 3, where A. p = 6 T £ _1 <5. When K = 2, op = S _1 <5/v / <5 T S~ 1 (5. 
By (??) (Supplementary Material) in the proof of Theorem f.3, we have d T Dd = 


8 t B5 


2||<5 t D£ 1/2 | 


(4.11) 


<5 D(2/x 2 — xi — X 2 
2||5 I DS 1/2 || 2 



<5 D(xi + x 2 - 2 fj.fi) 
2||5 r DS 1/2 || 2 
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<5 i S <5 = 4Ai(H). Hence, our convergence rate on the right hand side of ( 4.12) is 


O p (A 1 (S)||a 1 ||iS re ) = O p |(< S > T £~ 1 <5) 

= o v ( iis-mii; 


E _1 <5 

2 / K log p 1 

VS t 'E~ 1 S 

1 V n J 


1 c||2 , /logp 


n 


Compared to the convergence rate in (4.13), our convergence rate does not have the 
first term in (4.13). 


4.2 The case of K > 2 


We first illustrates the relationship between sparsity assumptions on Sap, • • • , S olk-i and 
{jjl, — fXp 1 < i j < K} in the following lemma. 


Lemma 2. Suppose that Conditions 1-2 hold. Then we have 


1 

(K-l)c oy /2KX^E) 


( max 11 ii — m >■ I I i I < max IlSadb 

\f<i±3<K J ) l<i<K-l 


< 


y'Os 


( max IIMi — Mi Hi 
\1 3 


Since Ai(S) > c\ by Condition 2 (a), Lemma 2 implies that if Ai(H) is bounded from the 
above, then maxi<j<x-i USajHx has the same order as ma||Mi — My Hi- If Ai(S) —* 
oo, we have max^^x-i ||Sai||i/ maxi <i^j<K || Mi — My 11 i 0- Therefore, making sparsity 
assumptions on Sop, • • • , Sa^-i is equivalent to or weaker than assuming the sparsity of 
{Mi — M pi < i j < K} in l\ norm. 

We define the following measurement of sparsity on a* and Sa,, 1 < i < K — T: 


A p = max {||ai||i, ||Sa<||i}. (4.14) 

In the following theorem, we show that for each 1 < i < KH l, the l\ sparsity of the estimate 
a, is bounded by A p multiplied by a constant which does not depend on n and p, and a, is a 
consistent estimate. Moreover, we show that the subspace spanned by • • • , is a con¬ 
sistent estimate of the subspace spanned by {Ba^ • • • , Ba,} (or equivalently the subspace 
spanned by {Sqi, • • • , Sa,}) and provide the convergence rates, where £ ■ is the solution 
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to (3.3). In this paper, to measure whether two subspaces with the same dimensions in 
are close to each other, we use the operator norm of the difference between the orthogonal 
projection matrices onto the two subspaces. 

Theorem 4.4. Suppose that Conditions 1-2 hold. We choose the tuning parameter in the 
optimization problem (3.3) as K n = CAi(S)A p s n , where C is a constant large enough and 
independent of n and p. For any 1 < i < K — 1, let Q, and Q; be the orthogonal projection 
matrices onto the following subspaces of M2, respectively, 

W, = spanj^,^, • • • ,£J, W. t = span{£,. | 2 , • • • , $•}, (4.15) 

where £, ;: = B«j = Aj(E)£aj. If A p s n —» 0 as n,p —* oo, then for each 1 < i < K — 1, there 
exist constants D i: i, D i>2 and D i3 independent of n and p such that in Q n , 

||St||i < A:,iAp, ||an - at ,||| < A, 2 Ajs n , ||Qi - Qi || 2 < D^ 3 A 2 p s n . (4.16) 

Hence, for each 1 < i < K — 1, cti is a consistent estimate of oli, and the projection matrix 
Qi is a consistent estimate o/Q;. 

Based on Theorem 4.4, we will prove the asymptotic optimality of our classification rule 
and provide the corresponding convergence rate. When K > 2, the classification boundary 
of a linear classification rule is typically complicated and no explicit formula for the error 
generally exist. In the following, we first prove a theorem which provides the conditions 
for asymptotic optimality and the corresponding convergence rates for a large family of 
linear classification rules. Then by applying the general result to our method, we obtain the 
asymptotic optimality results. 

We consider a family of linear classification rules motivated by the following observation. 
The optimal classification rule Tqpt can be rewritten in the following way. Let 

a f i = 5r 1/2 (^. - /A, b ji = + /t), (4.17) 

where 1 < i, j < K. Then Tqpt assigns a new observation x to the ith class if abS~ 1/,2 (x — 
bji) < 0 for all j i. Based on this observation, we consider a family of classification rules 
having the form, 

T : to assign a new x to the ith class if a];s" 1/2 (x-b J .)< 0, for all j i, (4.18) 
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where a r , and b ]t are p -dimensional vectors which may depend on the sample X, and satisfy 


&ji — 


Hji 


hji = 


J lj > 


(4.19) 


for all 1 < i 7 ^ j < Ah Typically, and bj, are estimates of a^ and bjj, respectively. In 
addition to the optimal rule, many linear classification rules in practice belong to this family. 
For example, the classic Fisher’s rule (2.9) is of the form (4.18) with a^ = X l,/2 D(x J — x,) 
and hji = |(xj +Xj). The rule of our sparse Fisher’s discriminant analysis method is also a 
special case of (4.18) with 

a ji = S i/2 D(xj - Xj), h^. = ^(xj + Xj), (4.20) 

where D is defined in (3.10). Now we study the asymptotic optimality of a classification 
rule T in this family. It is relatively easy to calculate the convergence rates of a ji and hji 
in a given T. We will establish the asymptotic optimality of T and the convergence rate 
for At(X) / Ropt — 1 based on the convergence rates of a^ and h :)l , where At(X) is the 
conditional misclassihcation rate of T given the training sample X. 


Theorem 4.5. Suppose that Conditions 1 and 2 hold and the general classification rule T 
in (4.18) satisfies: a ji = —a \j and h ]t = b y. Let {S n : n > 1} be a sequence of nonrandom 
positive numbers with 8 n —> 0 and Ai(E)<5 n —> 0 as n —> oo. For any 1 < j i < K, let 


&ji tjii “t” 




(4.21) 


be an orthogonal decomposition of a^, where t^a^ is the orthogonal projection of a.ji along 
the direction of&ji, tji is a real number, and (a,,-j)_|_ is orthogonal to t^a^. Let 


dji = ajE 1/2 (h^ - nf) , dji = ajE 1/2 (b^ - Hi) = 


2 

\Ull 2 ' 


If the following conditions are satisfied. 


*ji\ 12 


112 II II 2^P^fi) j Iji 1 T O p (5 n ), 


dji dji j a.ji\\ 2 O p (5 n ), 


(4.22) 


(4.23) 


then the classification rule T is asymptotically optimal and we have 

RtOQ 

Ropt 


— - 1 = O, (vA,(B«.l 0 gKA,(B».}-']) . 

DPT V / 


(4.24) 
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To apply Theorem 4.5 to a specific linear classification rule with the form (4.18), we need 
to determine the sequence 5 n and verify the conditions (4.23). For our classification rule 
(3.9), which is a special case of (4.18) with and b ji as given in (4.20), it turns out that 
we can choose S n = A 2 p s n which is the convergence rate in Theorem 4.4. 

Theorem 4.6. Suppose that Conditions 1 and 2 hold, and Ai(3)ApS n — > 0 as n,p —> oo. 
Then our classification rule (3.9) is asymptotically optimal. Moreover, we have 

ITTS "1 = 0, (yMSJA^logKMEJAJ*,}-']) . (4.25) 

Comparing Theorem 4.6 with Theorem 4.3, we find that the convergence rate in (4.25) 
is slower than that for K = 2. This may be clue to the complicated classification boundary 
when K > 2. It is a future direction to investigate whether the convergence rates in Theorems 
4.5 and 4.6 can be improved. 

5 Simulation studies 

In the previous section, we have shown that the revised sparse Fisher’s discriminant anal¬ 
ysis method with soft thresholding (SFDA-threshold) has good theoretical properties. In 
this and the following section, we will show that SFDA-threshold also has good predictive 
performance as the original method (SFDA) in Qi et al. (2015) by comparing them with 
regularized discriminant analysis (RDA) (Guo et al. (2007), R package “rda”) and penalized 
discriminant analysis (PDA) (Witten and Tibshirani (2011), R package “penalizedLDA'’) 
through simulation studies and applications to real data sets. 

Three simulation models are considered. In each simulation, 50 independent data sets 
are simulated each of which has 1500 observations and three classes. In each dataset, for 
each observation, we randomly select a class label and then generate the value of x based on 
the distribution of that class. Then the 1500 observations in each dataset are randomly split 
into the training set with 150 observations and the test set with 1350 observations. There 
are 500 features (p = 500) in these datasets. For our methods, SFDA-threshold and SFDA, 
we use the usual cross-validation procedure to select tuning parameters r from {0.5,1, 5,10}, 
and A from {0.01,0.05,0.1,0.2,0.3,0.4}. For SFDA-threshold, we choose k in (3.3) from the 
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three values which are equal to ||Bciq||i multiplied by 0, 0.001 and 0.01, respectively. For 
RDA and PDA, the default cross-validation procedure in the corresponding packages are 
used. The details of the three simulation studies are provided below. 

(a) . Simulation 1: There is no overlap between the features for different classes. There are 

correlations among some feature variables. Specifically, let x tJ be the i th observation 
on the j th variable, 1 < j < 500 and 1 < i < 1500. If the i th observation is in 
class k(= 1,2,3), then x^ = Hkj + Z. t + if 1 < j < 30, and x tJ = /Xkj + e t] if 
j > 31, where Zi ~ Normal(0,1) and e 3 j ~ Normal(0, a 2 ) are independent. Here 
/i ij ~ Normal(l, 0.8 2 ) if 1 < j < 20, /J, 2 j ~ Normal(4, 0.8 2 ) if 21 < j < 30, /i, 3j ~ 
Normal(l, 0.8 2 ) if 31 < j < 50 and fikj = 0 otherwise. We consider the cases that 
a 2 = 1, 1.5 2 and 4, respectively. 

(b) . Simulation 2: There are overlaps between the features for different classes and the vari¬ 

ables are correlated. The i th observation, x* = (x*i, x t2 , ■ ■ ■ , 500 ) ~ Normal(/z fc , S), 

where = (fJ>k,i, Hk, 2 , • • • , Rfc, 500 ), if it is in class k, 1 < k < 3. The covariance 
matrix S is block diagonal, with five blocks each of dimension 100 x 100. The five 
blocks are the same and have (j, j') element O.ObW x a 2 . Also, nij ~ Normal(l,l), 
[i 2 j ~ Normal(2,1) and /i-^- ~ Normal(3,1) if 1 < j < 10 or 101 < j < 110 and ix^j = 0 
otherwise. We consider a 2 = 1, 2 and 3. 

(c) . Simulation 3: Observations from different classes have different distributions about 

the class means. If the i th observation is in class k, x* ~ Normal(/z fc , S*,). We take 
/j>ij = 3 if 1 < j < 10, /i 2 j = 2 if 1 < j < 20, yU 3 j = 1 if 1 < j < 30, and Hk 3 = 0 
otherwise. The covariance matrix is diagonal with the diagonal elements generated 
from the uniform distribution in (0.5,2) x a 2 . S2 is block diagonal, with five blocks 
each of dimension 100 x 100. The blocks have (j,j r ) element 0. x a 2 . And S 3 is 
block diagonal, with five blocks each of dimension 100 x 100. The blocks have (j, j') 
element 0.6 x a 2 if j ^ j' and a 2 otherwise. We consider o 2 = 1, 2 and 3. 

The mean misclassification rates (percentages) of 50 data sets for each simulation are shown 
in Table 1, with standard deviations in parentheses. The PDA has the highest misclassi- 
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fication rate in all simulations. SFDA-threshold performs similarly with SFDA and both 
methods have good prediction accuracies in all the simulations. 


Table 1: The averages and standard deviations (in parentheses) of the misclassihcation rates 
(%) for the simulations in Section 5. 



a 2 

SFDA-threshold 

SFDA 

RDA 

PDA 

Simulation 1 

1 

0.21(0.26) 

0.24(0.26) 

0.32(0.39) 

2.37(1.46) 


1.5 2 

1.52(0.77) 

1.54(0.71) 

1.75(0.96) 

5.40(2.07) 


4 

8.78(4.06) 

8.60(3.71) 

10.20(4.41) 

12.73(4.32) 

Simulation 2 

1 

0.48(0.43) 

0.48(0.47) 

0.79(0.73) 

0.86(0.57) 


2 

3.15(2.40) 

3.29(2.38) 

3.61(2.15) 

4.84(2.45) 


3 

5.05(2.57) 

5.10(2.43) 

6.05(2.99) 

8.55(3.52) 

Simulation 3 

1 

4.86(1.12) 

4.85(1.12) 

7.71(2.03) 

9.51(4.20) 


2 

13.02(2.73) 

12.84(2.79) 

18.74(2.84) 

20.42(5.72) 


3 

21.49(3.45) 

21.48(3.35) 

26.56(3.58) 

29.74(7.61) 


6 Application to multivariate functional data 

With the advance of techniques, multiple curves can be extracted and recorded simultane¬ 
ously for one subject in a single experiment. In this section, we consider two real datasets 
where observations are classified into multiple categories and for each subject, multiple curves 
were measured. We first apply the wavelet transformation to those curves, and then apply 
our method to the obtained wavelet coefficients. The setting for the tuning parameters is 
the same as that in the simulation studies. 

6.1 Daily and sports activities data 

This motion sensor data set, available in UCI Machine Learning Repository (Bache and Lich- 
man, 2013), recorded several daily and sports activities each performed by 8 subjects (Altun 
et al, 2010; Barshan and Yiiksek, 2013; Altun and Barshan, 2010) in 60 time segments. Nine 
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sensors (x, y, z accelerometers, x, y, z gyroscopes, x, y, z magnetometers) were placed on each 
of five body parts (torso, right arm, left arm, right leg, left leg) and calibrated to acquire 
data at 25 Hz sampling frequency. Therefore, for each activity, there are 480 observations. In 
each observation, 45 curves are recorded and each of them has 125 discrete time points. The 
purpose of the study is to build a classification rule to identify the corresponding activity 
based on the observed curves. 

We first apply the Fast Fourier Transformation to each of 45 curves to convert it from 
time domain to the frequency domain and get its spectrum curve. After filtering out the 
higher frequency, we use the first 64 frequency points for each of 45 frequency curves. Then 
we apply wavelet transformation with 64 wavelet basis functions to each of 45 spectrum 
curves and obtain 64 wavelet coefficients. In this way, for each observation, a vector with 
64 x 45 = 2880 wavelet coefficients is obtained as the features to make classifications. 

We consider nine activities which can be divided into three groups. Group 1 includes 
three activities: walking in a parking lot, ascending and descending stairs; Group 2 has 
three activities: running on a treadmill with a speed of 8 km/h, exercising on a stepper and 
exercising on a cross trainer; Group 3 includes rowing, jumping and playing basketball. We 
will consider seven classification problems. In each of the first three problems, we consider 
the classification of the three activities in each of the three groups. In each of the next three 
problems, we combine any two of the three groups and consider the classification of the six 
activities in the combined groups. The last problem is the classification of all nine activities. 
In each problem, for each class, we randomly select 30 observations as the training sample 
and all the other 450 observations as the test sample. The procedure is repeated 50 times 
for each of the seven problems and the averages and standard deviations of misclassihcation 
rates are reported in Table 2. SFDA-threshold performs similarly with SFDA and both 
methods have higher prediction accuracies than RDA and PDA in all cases. 

6.2 Australian sign language data 

The data is available in UCI Machine Learning Repository and the details of the experiments 
can be founded in Kadous (2002). This data set consists of samples of Auslan (Australian 
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Table 2: The averages and standard deviations (in parentheses) of the misclassification rates 
(%) for the daily and sports activities data. 


Classes included 

SFDA-threshold 

SFDA 

RDA 

PDA 

Group 1 

0.23(0.23) 

0.23(0.23) 

1.94(1.91) 

1.96(2.10) 

Group 2 

0.14(0.43) 

0.14(0.44) 

0.58(0.66) 

0.21(0.58) 

Group 3 

0.12(0.07) 

0.12(0.08) 

0.58(1.08) 

0.23(0.36) 

Group 1+2 

0.45(0.44) 

0.46(0.43) 

1.13(0.79) 

2.39(1.52) 

Group 1+3 

1.50(0.84) 

1.54(0.96) 

1.92(0.99) 

4.79(2.33) 

Group 2+3 

0.53(0.26) 

0.54(0.24) 

1.06(0.72) 

0.80(0.37) 

Group 1+2+3 

1.63(0.60) 

1.53(0.63) 

1.78(0.65) 

4.20(2.01) 


Sign Language) signs. Twenty seven examples of each sign were captured from a native 
signer using high-quality position trackers and instrumented gloves. This was a two-hand 
system. For each hand, 11 time series curves were recorded simultaneously, including the 
measurements of x, y, z positions, the direction of palm and five finger bends. The frequency 
curve of each of the 22 curves were extracted by the Fast Fourier Transformation and then 
were transformed by 16 wavelet basis functions. Hence, for each sign, we obtained 352 
features. We choose nine signs and divide them into three groups: Group 1 contains the 
three signs with meanings “innocent”, “responsible” and “not-my-problcm”, respectively; 
Group 2 contains “read”, “write” and “draw”; Group 3 contains “hear”, “answer” and 
“think”. As in the previous example, we consider seven classification problems. For each 
class, we randomly choose 20 observations as the training sample and the other 7 as the 
test sample. The procedure is repeated 50 times and the averages and standard deviations 
of misclassification rates are reported in Table 3. As in previous studies, SFDA-threshold 
performs similarly with SFDA and both methods have higher prediction accuracies than 
RDA and PDA in all cases. 
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Table 3: The averages and standard deviations (in parentheses) of the misclassification rates 
(%) for the Australian sign language data. 


Classes included 

SFDA-threshold 

SFDA 

RDA 

PDA 

Group 1 

0(0) 

0(0) 

1.24(2.32) 

0.19(0.94) 

Group 2 

0(0) 

0(0) 

1.43(2.76) 

4.57(5.61) 

Group 3 

1.24(2.11) 

1.14(2.05) 

3.05(3.94) 

3.9(6.5) 

Group 1+2 

0.19(0.65) 

0.62(1.26) 

0.76(1.31) 

3.81(2.93) 

Group 1+3 

0.81(1.71) 

0.62(1.16) 

1.29(1.94) 

2.24(2.38) 

Group 2+3 

0.93(1.45) 

1.06(1.68) 

1.72(2.13) 

5.16(4.34) 

Group 1+2+3 

0.73(1.02) 

0.57(0.95) 

1.14(1.11) 

6.0(2.78) 
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